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Bayesian, classical, and extended maximum likelihood approaches to estimation of 
^ ■ upper limits in experiments with small numbers of signal events are surveyed. The 

discussion covers only experiments whose outcomes are well described by a Poisson 
' statistic. A new approach, based on the statistical significance of a signal rather 

■ than on the number of events in the signal region, is proposed. A toy model and an 

Qs^ , example of a recent search for the lepton number violating decay r — > /X7 are used 

On I to illustrate application of the discussed techniques. 



1 Introduction 



Searches for rare signals often fail to detect a signal of sufficient statistical 
significance and thus face the need to set an upper limit on the signal rate. 
Unfortunately, there is no standard prescription for setting such limits, and a 
number of techniques have been employed in the past to meet this challenge. 
This problem is particularly important for the particle physics community. 
Upper limits on rare and forbidden decays can provide valuable constraints 
on physics beyond the Standard Model of Electroweak Interactions. 

This paper is stimulated by the observation that often authors do not pay 
enough attention to the choice of a procedure for upper limit estimation. 
Furthermore, many experimentalists do not provide a sufficient description of 
the procedure used in their analysis technique, assuming that the reader is 
smart enough to figure out the details. An example given in Section 2.2 shows 
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that, in some situations, the choice of a procedure can change the value of an 
upper limit by an order of magnitude, which suggests that this problem should 
not be taken lightly. 

In Section 2 the most popular approaches to upper limit estimation are sur- 
veyed and criticism is provided, where applicable. In Section 3.2 definitions of 
the statistical significance of a signal are discussed and a new technique, which 
estimates upper limits using these definitions, is proposed. Unlike commonly 
used procedures, this approach does not just count the number of events in the 
signal region, but takes into account statistical fluctuations of the background 
as well. In Section 4 various techniques are compared using a toy model. In 
Section 5 an example of a recent search for the lepton number violating decay 
T ^ [25], performed by the CLEO collaboration, is given. 

The emphasis is made on estimation of Tipper limits, though the discussion 
can be easily generalized to include construction of confidence intervals whose 
lower bound is not constrained to zero. 

There is no discussion of systematic effects in this paper. Throughout the 
paper the number of observed events, the expected background rate and, if 
applicable, the coordinates of observed events are assumed to be measured 
with perfect accuracy. 



2 Commonly Used Techniques for Upper Limit Estimation 

2.1 Bayesian Approach 

In a Bayesian approach one has to assume a prior probability density function 
(pdf ) of an unknown parameter and then perform an experiment to update 
the prior distribution. A prior pdf reflects knowledge that is available to an 
experimentalist before an experiment is performed. The updated prior is called 
the posterior pdf and is used to draw inference on the unknown parameter. 
This updating is done with the use of Bayes' Rule. For the moment let us 
ignore the issue of background, i.e., let us assume that the background rate is 
measured very accurately and thus can be treated as a known constant. Then 
the only unknown parameter is the signal rate s. Bayes' Rule gives: 



where n represents the number of observed events, f{n\s) is the conditional 
probability to observe n events, given the signal rate s, tt{s) is the prior pdf. 
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and 7r(s|n) is the conditional posterior pdf. Now, for any given confidence 
level, one can compute a Bayesian confidence interval for the signal rate s. For 
upper limit estimation the natural choice of the Bayesian confidence interval 
is of the form (0, Sq). Here, Sq denotes an upper limit and can be found from 
the equation: 



The confidence level is denoted by (l — a) following the conventional statistical 
notation. A nice feature of the Bayesian approach is that the zero value of an 
upper limit sq always corresponds to the zero value for the confidence level 
(l — a). As you will see below, this is not necessarily true for the classical 
approach. 

The most important step here is to define a prior distribution of the parameter. 
Naturally, this is the step which brings most of controversy (and sometimes 
confusion) into Bayesian methods. Many statistical textbooks treat a prior pdf 
as a purely subjective assumption which is based on experimenter's belief. At 
the same time there are authors [3-6] who advocate the objectivity of prior 
assumptions. In particular, Jaynes [5] stated the "basic desideratum" of the 
objective approach as follows: in two problems where we have the same prior 
information, we should assign the same prior probabilities. These authors have 
offered a number of mathematical procedures that can be used to convert prior 
knowledge into an exact formula for the prior distribution. Another important 
question is whether one should assume an informative prior, i.e., a prior which 
incorporates results of previous experiments, or a non-informative prior, i.e., 
a prior which claims total ignorance. Naively, it would seem unnatural to use 
non-informative priors since the power of the Bayesian approach comes from 
the fact that we update our prior knowledge rather than start every analysis 
from scratch. In particle physics the major objection against informative priors 
is based on the following argument: if we assume a prior which incorporates 
results of earlier experiments, then our experiment will not be independent 
of those and thus we will not be able to combine our results with the results 
of previous experiments just by taking a weighted average. So far, particle 
physicists have been largely ignoring informative priors. Thus, the discussion 
below covers only those Bayesian methods that assume a non-informative prior 
pdf for the positive parameter of a Poisson distribution. 

In the absence of background the conditional pdf f{n\s) is given by: 
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Bayes and Laplace [1,2], who pioneered statistical research employing Bayesian 
methods, stated that the non-informative prior for any parameter must be flat. 
This conclusion was not based on any strict mathematical argument, it was 
merely a product of their intuition. Modern advocates [8-11] of this approach 
do not offer any mathematical explanation either, they just consider a flat 
prior pdf as the most natural choice one can make. As natural as this assump- 
tion may seem, there are several objections. The most obvious argument is 
that if one can assume a flat distribution of an unknown parameter, then one 
can also assume a flat distribution for any function of this parameter, and 
these two assumptions are clearly not identical. If an experiment is measuring 
the mean of the Poisson statistic (3), then one can argue that the most natural 
candidate for the unknown parameter is the signal rate s, and we have no phys- 
ical reasons to consider any functions of this parameter except the parameter 
itself. In other experiments the situation may be not so simple. For example, 
if an experiment is measuring a neutrino mass, then typically the measured 
quantity is not the neutrino mass itself, but the neutrino mass squared. In this 
situation it's not clear whether one should choose mass or mass squared as a 
candidate for the flat distribution. Jeffreys [3] resolved this problem by intro- 
ducing an invariate prior pdf 1/6 which, he stated, was a valid non-informative 
prior for all problems where the unknown parameter 6 could vary from to 
-|-oo. His choice was mostly motivated by the fact that d6/6 oc d6"/6^, i.e., 
the pdf of 6 stays invariant under any power transformation. All non-power 
functions of 6 were rejected by Jeffreys as non-physical under the assump- 
tion that the parameter ^ is a dimensional quantity. This argument was put 
on a more rigorous mathematical basis by Jaynes [5] who stipulated that a 
prior pdf has to stay invariant under any symmetry transformation^ that 
does not change the physics of an experiment. His conclusion was similar to 
that of Jeffreys, namely that the non-informative prior for a Poisson statis- 
tic (3) has to be proportional to 1/s. An alternative approach was developed 
by Box and Tiao [7] who introduced the notion of a data-translated likelihood. 
In their approach a prior pdf is non- informative if the location, but not the 
shape, of the corresponding posterior likelihood^ is determined by the un- 
known parameter. Thus, the location of the posterior likelihood is completely 



^ Strictly speaking, Jaynes' argument is not applicable to the Poisson distribu- 
tion (3) with a non-dimensional mean rate s. Originally Jaynes' non-informative 
prior was derived for the Poisson pdf f{n\s) = e~*'*(st)"/n!, where n is the number 
of counts, t is the counting time, and s is the signal rate per unit time. However, 
the requirement of dimensionality seems to be somewhat arbitrary. For example, in 
the above formula t can be the amount of statistics accumulated in the experiment 
and s can be the signal rate normalized to this amount of statistics. A prior pdf has 
to stay invariant under the transformation s' = qs, t' = t/q, i.e., Jaynes' argument 
is applied. 

^ As usually, a likelihood function is defined by swapping argument and parameter 
in the expression of the corresponding pdf. 
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defined by data, while its shape had been determined before the data were 
seen (hence the name "data-translated" likelihood). It may not be possible to 
construct a data-translated likelihood for every distribution. In this case one 
can use Taylor expansion to construct an approximate data-translated likeli- 
hood. In particular, a non-informative prior which produces an approximate 
data-translated likehhood for the Poisson statistic (3) is given by 

In the presence of background, the Poisson pdf (3) has to be modified to 
account for the non-zero background rate b: 

/W.) = e-(^+'')^^. (4) 



If the background rate is accurately measured, it can be treated as a known 
constant. In this case an argument similar to that of Ref. [5] gives l/{s + b) and 
an argument similar to that of Ref. [7] gives l/^/s + b for the non-informative 
prior. In general, for the prior pdf 

7r(s) oc - — \— ; < m < 1 ; (5) 



the posterior distribution is given by 



Trisln) = — ; — , 6 



where 

oo 

r(p,/x) = J sP-^e-'ds; p > 0; > 0; (7) 



is an incomplete gamma-function. Substituting the posterior pdf (6) into 
Eqn. (2), we obtain: 

1-.^!- ^^"-"^ + ^'^° + ^^ . (8) 

r(n-m + 1,6) ^ ^ 



A gamma-function is not well-defined if its first argument is equal to zero or a 
negative integer. Thus, at n = (no events observed) and m — 1 (the l/{s + b) 
prior) Eqn. (8) fails to find an upper hmit. Under the flat prior m — 0, Eqn. (8) 
turns into the formula 



V" p-(.so+b) (so+br 

En 
k=0 A;! 
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which was adopted by Particle Data Group [13]. Here and below, 0° is by 
definition equal to one. 

If the background rate is measured with large uncertainty, then the situation 
is more complicated. The usual approach is to modify the Poisson statistic (4) 
as: 



where f{b) is the measured or predicted pdf of the background rate b. The 
argument of the previous paragraph, leading to the + b) and 1 / \/s + b 
expressions for the non-informative prior, is not necessarily valid in this case. A 
more consistent Bayesian approach would be to derive a joint non-informative 
prior 7r(s, b), thus treating both the signal and background rates as unknown 
parameters. An example of such derivation is given in Ref. [6]. A discussion 
of intricacies, that might arise from inclusion of background uncertainty into 
the upper limit calculation, is beyond the material covered in this paper. 

2.2 Classical Approach 

The classical (or frequentist) approach is traditionally interpreted in the fol- 
lowing way: if a (1 — a) classical confidence set is constructed for the unknown 
parameter, then the probability for this confidence set to cover the true value 
of the parameter equals (1 — a), that is, if an infinite number of identical 
independent experiments is performed and for each of these a (1 — a) confi- 
dence set is constructed, then 100(1 — a)% of these confidence sets will and 
100q;% of these confidence sets will not contain the true value of the param- 
eter. Rules for construction of classical intervals were outlined in a famous 
work [15] by Neyman. Here 1 treat the terms "classical" and "frequentist" as 
equivalent. The implication is that the classical approach is inevitably con- 
nected to the concept of an identical independent experiment. Sometimes this 
concept is misunderstood, and are curious attempts to treat certain problems 
in the "classical" vein while, in fact, this treatment has nothing to do with 
the frequentist approach. An example of such misunderstanding is shown in 
this Section. 

A confidence set for the unknown signal rate s is typically constructed [16] as 
a confidence interval Si < s < S2 satisfying: 




(10) 
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For upper limit estimation it is natural to consider intervals < s < Sq, where 
So is the value of an upper limit. Eqn. (11) is thus reduced to: 



1 - 



a = 




-(sa+b) (^0 + b) 



(12) 



where f{k\so) was replaced by its definition (4). 

Eqn. (12) looks similar to the Bayesian formula (9), except that the denomi- 
nator is now absent. The denominator in Eqn. (9) represents the probability 
of observing n or less background events in the signal region, and thus, it is 
always less than 1 except at 6 = 0. Therefore, for the non-zero background 
rate, i.e., b ^ 0, the classical approach (12) always provides a smaller value of 
an upper limit than the Bayesian approach (9) with a flat prior. The differ- 
ence becomes significant when the denominator in Eqn. (9) is small, i.e., when 
the number n of events observed in the signal region is small as compared 
to the expected background rate b. For example, for the observed number of 
events in the signal region n = 3 and the expected background rate b = 6.5 
the classical approach (12) gives an upper limit sq = 0.18 at 90% confidence 
level, while the Bayesian approach (9) with a fiat prior gives Sq = 3.39, i.e., a 
difference of more than an order of magnitude. Another feature of Eqn. (12) is 
that now a zero upper limit: Sq — 0, does not give you a zero confidence level. 
When the expected background rate b is large, one can achieve a situation 
when 1 — q; < 1 — J2k=o^~^b'' /^^-^ i-^-) formula (12) gives an unphysical nega- 
tive value of an upper limit. Advocates of the Bayesian method consider this 
as an undesirable feature of the classical approach. But the failure to set an 
upper limit is not necessarily a bad feature. It just implies that the observed 
outcome of an experiment is highly improbable, and one should question the 
experimental technique that was used to measure the signal and the back- 
ground. 

Zech [12] made an attempt to derive the Bayesian formula (9) using the clas- 
sical approach. To arrive at Eqn. (9), Rcf. [12] postulates that the number of 
background events cannot exceed the number n of observed events and there- 
fore one has to renormalize the probabihty (12) according to this constraint. 
This approach misunderstands the concept of frequentist coverage. Eqn. (12) 
answers the question: what is the probability of observing n or more events in 
an identical independent experiment^ This implies that the number n of events 
cannot be fixed at any particular value. If you fix n, then you confine yourself 
to this particular drawing; the concept of an identical independent experiment 
is therefore inapplicable. In this case a binomial distribution should be used 
instead of a Poisson pdf. 

Confidence intervals are not uniquely defined by Eqn. (11) unless one imposes 
specific criteria on their construction. In the situation discussed above, the 
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uniqueness was introduced by requiring that the confidence set must be an 
interval (O..So). One can choose another requirement and obtain a different 
confidence set. In general, if the problem is reduced to one variable and one 
unknown parameter, then confidence intervals are obtained via construction 
of confidence belts [20]. The idea behind this technique is the following. For 
every assumed value of the signal rate s, one finds an acceptance interval 
ni{s) < n < 712(5) which satisfies: 

l-a=^f{n\s). (13) 

n=ni 

Due to the discrete nature of the Poisson distribution, it is usually impossible 
to find values of rii and n2 that satisfy Eqn. (13). The standard solution is 
to stay on the conservative side and to search for an acceptance interval that 
gives at least the required coverage: 

n2 

l-a< Y.fin\s). (14) 

n=ni 



The obtained functions ni(s) and n2{s) define two curves on the s-vs-n plane. 
Then, for every value of n, one obtains an interval Si{n) < s < S2{n) which 
lies between these two curves. The obtained interval is a (1 — a) confidence 
interval for the specific value of n. The order in which values of n are added 
to the acceptance region (rii, 712) for a specific value of s is called an ordering 
principle. Thus, every ordering principle corresponds to a specific set of confi- 
dence intervals (si, S2) indexed by the observation variable n. The well-known 
examples arc an old paper [18] by Crow and Gardner and a recent paper [17] 
by Fcldman and Cousins. Rcf. [18] minimizes the length of the acceptance 
interval (77-1,^2) defined by Eqn. (14). Ref. [17] employs an ordering princi- 
ple based on likelihood ratios. The latter was adopted by the last release of 
the Particle Data Group Review [14]. In both approaches an experimentalist 
does not have to decide whether she/he wants to quote an upper limit or a 
confidence interval: she/he simply applies the chosen procedure to construct a 
confidence interval (si, S2). If the lower bound turns out to be strictly equal to 
zero: Si = 0, then an upper limit is quoted, and if the lower bound is positive: 
Si > 0, then a confidence interval is quoted. This versatile procedure, however, 
has one subtle problem. The fact that the constructed 90% confidence interval 
has a non-zero lower bound does not guarantee that a signal of high statistical 
significance is observed. There is nothing surprising about that. A measured 
signal rate is usually quoted if the statistical significance of the signal exceeds 
3 which corresponds to 99.87% of the area under a Gaussian peak. At the 
same time, typical confidence levels used to quote upper limits are 90% and 
95%. For example, at 6 = 1 and n = 3 the 90% confidence interval obtained by 
the procedure of Ref. [17] equals (0.10, 6.42). At the same time, the statistical 
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significance a calculated as l/\/27r e~^^/'^dx = Sfcln 6"''^*^/^' is only 1.4. 
Thus, the non-zero lower bound has no clear interpretation: it is a vague in- 
dication that a clear signal might be observed in a future experiment. On the 
contrary, if one chooses to construct a one-sided interval and thus quote an 
upper limit, this clearly expresses the fact that a signal of sufficient statistical 
significance was not observed and therefore the signal rate is believed to be 
consistent with zero. 



2.3 Unhinned Extended Maximum Likelihood Fit 



Unbinned extended maximum likelihood fits [21,22] have become popular in 
the past few years. This is an excellent analysis tool for rare signal searches 
since, unlike a standard maximum likelihood, an extended maximum like- 
lihood correctly incorporates the Poisson error on the number of observed 
events. Thus, when one observes contributions from two processes (e.g., signal 
and background), one should use an extended unbinned likelihood: 

L{s,h)^——Y{{sS, + hB,) (15) 
■ 1=1 



instead of a standard likelihood: 

TV 

L{fs) = \{{fsS^ + {l-fs)Bi) . 
1=1 

Here Si and Bj represent the signal and background spatial pdf 's, respectively. 
N is the total number of events observed in the signal region and in the vicinity, 
and s and b are the signal and background rates, respectively. 

It has also become a common practice to extract an upper limit value by 
integrating a likelihood function: 



where Sq, as usually, denotes the value of an upper limit. The likelihood L{s) 
is typically obtained by integrating the two-dimensional likelihood L(s, b) over 
the parameter space b: 

oo 

L{s) = J L{s,b)db . (17) 
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This technique is apphed, for instance, to estimate upper hmits in rare B decay 
studies [23,24]. It is important to reahze that the integration of the hkehhood 
imphcitly uses the Bayesian approach. As it was pointed out by Cousins [19], 
in particle physics, the prior is almost always taken uniform (where non-zero), 
although this assumption goes unemphasized by those who merely report that 
they "integrated the likelihood function". 

To understand this comment, one should recall the definition of a likelihood 
function: 

L{s,h) = f{x\sM , (18) 



where x = {N,xi,X2, .■.,xn} is a set of observables, the quantity Xi {i = 
1,...,N) is the coordinate of the ith event, the symbol / denotes the cor- 
responding pdf, and a vertical line | always implies conditional distribution. 
Applying Bayes' Theorem, one obtains: 

Tf ,N f{s,b\x)f{x) 



fis,b) 

If signal and background are assumed independent[^, their joint pdf must 
factorize: f{s,b) = f{s)f{b). Multiplying both sides of the above equation by 
and integrating over b, one obtains: 

L{s,b)f{b)db= — . 

f{s) 

The quantity on the right side of the equation can be recognized as L{s). 
Therefore, 

oo 

L{s) = J L{s,b)f{b)db . (19) 





In the small signal limit Si <^ Bi, the likelihood (15) is expressed as: 

L{s,b) = '-—b^UB.. 

■ 1=1 

Due to the fact that the contributions from the signal and background rates s 
and b decouple, the background pdf f{b) in Eqn. (19) cannot change the ratio 

^ The assumption of their independence seems natural. However, Prosper [6] de- 
rived a non-informative joint prior f{s, b), where the signal and background contri- 
butions are non-factorizable. In fact, this assumption is not as obvious as it may 
seem. 
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Jq^ L{s)ds/ L{s)ds. But when the approximation Si <^ Bi is no longer 
vahd, this is not so. Thus, Eqn. (17) treats the pdf f{b) as prior knowledge 
and assumes that it is flat in the interval (0, +00). 

Furthermore, the posterior distribution f{s\x) can be represented as: 

Hs)f{s) 



f(s\x) = 



which is similar to the Bayesian formula (1). Thus, Eqn. (16) implicitly as- 
sumes a flat prior f{s) and integrates f{s\x) oc L{s) to extract an upper limit. 
When the experiment is dominated by background, i.e.. Si -C Bi, the likeli- 
hood (17) is proportional to and the extracted value of an upper limit is 
equal to 2.3 at 90% confidence level. The approximation Si <^ Bi is equivalent 
to the situation when no events are observed in the signal region (n = 0), but 
the expected number of background events is non-zero (6 7^ 0). Under these 
conditions, the Bayesian method (9) with a flat prior gives the e"'* behavior 
as well. Neither the classical approach, nor the Bayesian methods with alter- 
native priors give the behavior for n = and b ^ 0; all these techniques 
give smaller upper limit values. Thus, the integration of the likelihood and 
the Bayesian method with a flat prior give the most conservative estimates in 
background-dominated analyses. 

One can easily avoid the implicit use of the Bayesian method and set an upper 
limit using the frequentist deflnition. It is straightforward to implement this 
approach by running a Monte Carlo job. First, the observed data Xobs is fltted 
to the likelihood function (15) to extract estimates Sobs and bobs of the true 
signal and background rates. Then, for every assumed value of an upper limit 
So, a Monte Carlo sample, consisting of a large number of experiments, is 
generated. For each experiment the number of signal and background events 
are generated assuming Poisson distributions: 

/(s)=e-°|; /(6) = e-''-^; N ^ s + b; 

and the coordinates of events are generated, based on the spatial pdf's Si 
and Bi. Then the outcome of every experiment is fitted to the same hkehhood 
function (15) and the distribution of measured signal rates /(smeas) is plotted. 
The confidence level corresponding to this value of sq is estimated as: 

1 Isnh, f {^meas)dSjyieas /r.n\ 

Jo J meas I '^'^ meas 



Thus, a value of Sq is chosen that gives the required confidence level (1 — 
a). This agrees with the frequentist approach when an experimentalist draws 
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inference about an unknown parameter on the basis of observed data only, 
without making any subjective assumption. 



3 Calculation of Upper Limits, Based on Statistical Significance of 
a Signal 

The Bayesian and classical approaches described in Sections 2.1 and 2.2 are 
based on the conditional probability f{n\s), i.e., the probability to observe n 
events in the signal region, given the signal rate s. However, the number of 
events observed in the signal region by itself is not so important, since the 
usual goal of an experiment is not to count events in the signal region, but 
to observe a signal of high statistical significance. These approaches do not 
take into account the fact that the number of background events fluctuates 
too. For example, the classical method (12) estimates the confidence level as 
a probability to observe a larger number of events in the signal region than 
the number observed in the experiment, given the assumed value s of the 
signal rate. But, at the same time, the number of background events in the 
sideband can fluctuate up to high values or down to zero. The former implies 
that no signal is observed, and the latter implies that a very clean signal 
is observed. The methods based on the pdf f{n\s) take into account both 
these possibilities and thus do not distinguish between signals of high and low 
statistical significance. 

Therefore, one should replace the conditional pdf f{n\s) with the conditional 
pdf f{a\s), where a represents the statistical significance of a signal. Before 
we proceed to derivation of the related mathematical formalism, we need to 
discuss various definitions of statistical significance. 

3.1 Definition of Statistical Significance 

If n events are observed in the signal region and b is the estimated background 
rate, then the statistical significance cr of a signal is defined by 




that is, it represents the probability of observing n or a larger number of 
background events in an identical independent experiment. The background 
rate b can be estimated in a number of ways. If the background rate is esti- 
mated independently of the data seen, e.g., from a Monte Carlo analysis or 
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from another data sample that is known to contain no signal events, then it 
is common to assume that the estimated background rate is proportional to 
the rate observed in the independent experiment: 

b — CindP'ind , (22) 

where riind is the number of events observed in the independent experiment, 
Qnd is the corresponding scale factor and the subscript ind implies that the 
background rate is estimated independently of the data seen. If the background 
rate is estimated from data, then the situation is more complicated. In this 
case, the background rate is usually estimated from sideband, i.e., a region 
which is located near the signal region and which contains no signal events: 

b = Csbrisb ■ (23) 

Here Ugb is the number of events observed in the sideband, and Cs6 is the 
sideband-to-signal scale factor. Under the assumption of flat background, the 
scale factor is given by: 

a = 4^ , (24) 

where Agig is the area of the signal region, and Agb is the area of the sideband. 

The definition (23) of the background rate is, in my opinion, incorrect. It 
works only if the sideband is much larger than the signal region, and thus 
the estimate of the expected background rate b is accurate enough. However, 
in a situation when the areas of signal and sideband regions are comparable, 
Eqn. (23), combined with Eqn. (21), overestimates the significance of a signal 
for large numbers of observed events n > b, increasing the probability of 
a "discovery". This is caused by the fact that the formulas (21) and (23) 
answer the wrong question. The correct question is: what is the probability of 
observing n or a larger number of events in the signal region if the true signal 
rate is zerol To answer this question, one has to assume that all events, that 
were observed in the experiment, came from background, and therefore one 
has to redefine: 

b^CN , (25) 

where is the number of events observed in the entire region, which includes 
the signal region, sideband and, perhaps, an intermediate region between the 
signal and sideband regions, and C is the corresponding scale factor. Under 
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the assumption of flat background, the scale factor is given by: 



A 



(26) 



where A is the area of the entire region. In the same spirit, statistical signifl- 
cance, as defined via likelihood, is given by: 



i.e., it gives a number of standard deviations from the observed signal rate, 
which maximizes L{s), to the zero signal rate, under the assumption that 
—2 In L{0) / Lmax is distributed as Xi- Formula (25) should be used only to 
estimate statistical significance of a signal; for upper limit calculation one is 
not allowed to assume that all observed events come from background and one 
has to apply the standard definition(23) of an expected background rate. 

A numerical discrepancy between the definitions (23) and (25) can be illus- 
trated on the following hypothetical example. Let us assume that the back- 
ground spatial pdf is flat in the vicinity of the signal region, that the area of 
the sideband is equal to that of the signal region, and that one event is ob- 
served in the signal region and no events are observed in the sideband. Then 
an experimentalist, who uses the definition (23) of the statistical significance, 
would claim that she/he observes a very clean signal (cr = +oo), while an 
experimentalist, who uses the definition (25), will estimate the statistical sig- 
nificance of the observed signal as o" = 0.27. The latter reflects the fact that, 
due to the specific choice of the sideband and signal regions, the observed 
statistic (one event) is insufficient for positive identification of the signal. 

This example is purely hypothetical. Of course, in the situation when only 
one event is observed and the area of the sideband is comparable to that of 
the signal region, most of experimentalists would choose to quote an upper 
limit instead of a measurement. A more realistic situation occurs when the 
sideband is somewhat larger than the signal region, there are some events 
both in the signal region and in the sideband and the calculated statistical 
significance is close to three. Then one has to make a binary decision: if the 
statistical significance is larger than three, then a measurement is quoted, 
otherwise an upper limit is quoted. In this situation the choice of a specific 
procedure becomes fairly important, and the example above shows that the 
numerical discrepancy between the two approaches can be significant. 

The conclusion that 1 would like to reach in this Section is that there are 
two situations that should be treated differently. The first situation occurs 
when the background rate is estimated independently of the data seen. In this 




(27) 
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case one has to adopt the definition (22) of the background rate and therefore 
define statistical significance as: 



V 27r I A;! 



The second situation takes place when the background rate is estimated from 
the same data sample which is used to draw inference about the unknown 
signal rate. As shown above, in this case one has to adopt the definition (25). 
Statistical significance is then defined as: 

i./V/^d..£e-(^^[^. (29) 



(J k=n 



These two formulas will be used to estimate upper limits, based on the statis- 
tical significance of a signal. 



3.2 Upper Limit Calculation 



As shown in Section 3.1, the statistical significance cr of a signal is defined 
either by Eqn. (28) or by Eqn. (29). I will take Eqn. (29) as an example and 
proceed to derive a cumulative density function (cdf) P{a < a'). If Eqn. (28) 
is chosen, the derivation goes through similar steps; thus, only the final result 
will be quoted. 



When the background and signal rates are estimated from the same data 
sample, one can rewrite Eqn. (29) as: 

e-^'/'dx = f: e-^('^+"<'-) , (30) 



2n J t: k\ 



a 



where n is the number of events observed in the signal region, Uout is the 
number of events observed in the outer region, which covers the sideband and, 
perhaps, an intermediate region between the signal region and the sideband, 
and C is the corresponding scale factor. Under the assumption of fiat back- 
ground, the scale factor is given by: 



where Agig and Ag^t are the areas of the signal and outer regions respectively. 
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The random variables n and rtout are drawn from independent Poisson distri- 
butions: 

n ~ Poisson (s + 6); riout ~ Poisson (Ao„t); (32) 



where Xout is the mean of the Poisson distribution that controls the number 
of events in the outer region ^out- The best unbiased estimator of Xout is the 
number of events actually observed outside the signal region: 

Xout — 1^out,obs ) ("^"^) 

and thus the equality Xout = nout,obs is implied in the further discussion. In 
Eqn. (32) the expected number of background events b in the signal region is 
estimated in the traditional way: b — Csbnsb,obs, and its value is not used to 
determine the statistical significance of the signal. 

The cdf P{a < a') cannot be expressed in a convenient analytical form. How- 
ever, the problem can be simplified by introducing a new variable: 

p^J2 e~^("+"°"*) [C{n + nout)f ^ ^^^^ 

Hence, 

1 °° 

^ / e^^'/^dx = 1 - p ; < p < 1 . 

/2'K J 



2n 

a 

The variable p is a monotone function of a. Therefore, its cdf can be obtained 
by a one-to-one transformation: 

P{p{a)<p{a')) = P{a<a') , (35) 



and one can use the variable p instead of a to set an upper limit. The cdf of 
p is given by: 

oo ™out~l ( Q M K\'"' \nout 

l-P{p<p'\s) = J2 £ e-(.+6)li±^g-A„„*:W_ ^ ^3g) 

n=lnout=0 ^o"*" 



where n'g^^ — n'o^^ln) is the smallest non- negative integer which satisfies the 
inequality: 

k=0 ^' 
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for the given value of n. 

The infinite sum over n in Eqn. (36) converges quickly and can be easily 
calculated numerically. In a simulation, described in Section 4, the summation 
over n from 1 to 1000 was enough to achieve an accuracy of 10"^ or better for 
{l-Pip<p'\s)). 

Now one can choose the approach one would hke to use. All the techniques 
described in Sections 2.1 and 2.2 are apphcable, but now instead of a con- 
ditional pdf f{n\s) one would have to use the conditional cdf P{p < p'\s). 
The easiest thing to try is to employ the classical approach and define the 
confidence interval as (0, Sq)- In this case, through the reasoning described in 
Section 2.2, one arrives at the formula similar to Eqn. (12): 

1-q; = 1-P(p<P„6,|so) , (38) 

where 

fe=0 ^' 

is the observed value of the variable p. 

When no events are observed in the signal region {uobs — 0), one obtains 
Pohs — 0, therefore n'^^ — +00 for any n > 1, and Eqn. (38) is reduced to: 

1 - a = 1 - 6-^"°+''^ , 

which coincides with the classical expression (12). This reflects the fact that in 
this case background fluctuations are irrelevant as the statistical significance 
cr is always equal to —00, no matter how many background events we observe 
in the sideband. 

When no events are observed outside of the signal region {\out = nout^bs = 
and therefore 6 = 0), one obtains n'^^^ > 1 for any n > 1 and pobs > 0, therefore 
the sum over Ugut in Eqn. (36) is always equal to 1, and the formula (38) is 
reduced to: 

1 - a = 1 - e-'" , 

which again is identical to the classical expression (12). This reflects the fact 
that no background fluctuations are expected in this case, and hence, the 
statistical signiflcance cr is a function of the signal rate s only. 

If the background rate is estimated independently of the data seen, then one 
should start from the deflnition (28) of statistical signiflcance and repeat the 
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same logical steps to arrive at an equation similar to Eqn. (36). The random 
variables n and riind are drawn from independent Poisson distributions: 

n ~ Poisson(s + b); riind ~ Poisson ( A j„d); (40) 

where Xind is the mean of the Poisson distribution that controls the number 
of events in an independent sample which is used to estimate the background 
rate. The best unbiased estimator of X^d is the number of events actually 
observed in the independent data sample: 

Xind — 1^ind,obs i ('^■^) 

and thus the equality Xind = i^ind,obs is implied. Now the background rate used 
in the definition of statistical significance and the background rate used to 
calculate an upper limit are estimated similarly. Without losing generality, 
one can stipulate that: 

Xind — V Cind ■ (42) 

The cdf of p is now given by: 

n=\ nir,d=0 ^^^fi- 

where = ^inci(^) smallest non-negative integer which satisfies the 

inequality 

e-Cindn'i^d ^ ^, ^^^^ 

k=0 ^' 

for the given value of n. 



4 Compcirison of Veirious Approaches Using a Toy Model 

The performance of all approaches discussed in this paper is compared using 
the following toy model. The total observation region is defined as an interval 

(—10,10). The signal spatial pdf is taken to be a Gaussian with zero mean 
and unit variance, and the background spatial pdf is taken to be fiat. Under 
these conditions, the signal region is defined as (—2.5, 2.5), and the sidebands 
are defined as (—10, —5) and (5, 10). 
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The value of the expected background rate is taken to be & = 0, 1, .... 5 consec- 
utively, which corresponds to the number of events observed in the sidebands 
Usb = 0, 2, 10. The number of events in the intermediate region (—5, —2.5) 
and (2.5, 5) is set equal to the expected number b of background events in the 
signal region, since the area of the signal region is equal to that of the interme- 
diate region. Positions of events in the sideband and intermediate regions are 
generated under the assumption of a uniform spatial pdf Bi. For every value of 
the expected background rate b, upper limits obtained with various approaches 
are estimated for the number n of events in the signal region varying in integer 
steps from to 6. The results are plotted in Fig. 1. Inside the signal region all 
events are positioned precisely at zero. It is assumed that the background rate 
in this experiment is estimated from sidebands; thus, to implement the classi- 
cal method of Section 3.2, formulas (38) and (39) are used. To estimate a 90% 
CL upper limit by the Monte Carlo technique of Section 2.3, I find the value 
So of an upper limit by using the method of binary division. For every assumed 
value of So, I generate a Monte Carlo sample consisting of 50,000 experiments 
and use the formula (20) to estimate the corresponding confidence level. This 
procedure is repeated until the value of Sq that gives a 90% confidence level is 
obtained. The required accuracy for the computation of Sq is taken 10~^. For 
comparison, lower and upper bounds of 90% confidence intervals obtained by 
the procedure [17] are shown with crosses. One cross for a specific value of n 
corresponds to the situation when the lower bound is strictly zero. 

As shown in Fig. 1, at n = the Bayesian method (9) with a flat prior pdf 
and the integration of the likehhood, described in Section 2.3, always give 
2.30, while all the other approaches give somewhat smaller values. For this 
particular model, the likelihood integration technique of Section 2.3 turns 
out to be the most conservative approach, except for b = 0. The Bayesian 
method with a flat prior pdf always gives larger upper limit values than the 
Bayesian methods with the l/y/T+b and l/{s + b) priors and both classical 
approaches discussed in Sections 2.2 and 3.2. The results produced by the 
Bayesian method with the l/(s + 6) prior can be obtained by shifting the 
corresponding results obtained with a flat prior one step to the right, e.g., 
{n = 0,m = 0} and {n = l,m = 1} obviously produce the same result 
when substituted into the gamma-function r(n — m -|- 1, Sq + ^)- The classical 
approaches of Sections 2.2 and 3.2 fail to set an upper limit at n <C 6, i.e., 
when background dominates over signal. The classical procedure, based on 
the statistical signiflcance of a signal, always gives a smaller upper limit value, 
compared to the standard classical approach of Section 2.2, except at n = 0. 
In the specific situation, when the expected background is zero, the Bayesian 
method with a fiat prior pdf, both classical approaches and both hkelihood 
techniques of Section 2.3 give identical results. 
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Fig. 1. Upper limits as functions of the number n of observed events in the signal 
region under the assumptions described in the text. The expected number of back- 
ground events in the signal region takes values of 1) 6 = 0; 2) 6 = 1; 3) 6 = 2; 4) 
6 = 3; 5) 6 = 4; 6) 6 = 5. 20 



5 An Example of a CLEO Analysis: r 



In 1996 the CLEO Collaboration searched [25] for the neutrinoless decay r — > 
//7 and set an upper limit, which is, so far, the most stringent limit on the 
T — >■ /i7 branching fraction. In this analysis 3 events were observed in the 
signal region and the expected number of background events was estimated 
to be 5.5. The signal Monte Carlo and data energy- vs-mass distributions are 
shown in Fig. 2. The Bayesian approach with a fiat prior pdf was used and 
the upper hmit value was estimated as 3.6 at 90% confidence level. This value 
was divided by the integrated luminosity and efficiency factor and thus an 
upper limit of 3.0 x 10^^ at 90% confidence level was obtained for the r — /i7 
branching fraction. In Table 1 arc shown the values of upper limits for this 
analysis calculated with the alternative techniques. 
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Monte Carlo distribution. 



The value given by the procedure [17] of Fcldman and Cousins is a rough 
estimate only, since the combination of input parameters {n = 3,b = 5.5} is 
not shown in their tables. 

To implement the maximum likelihood approach of Section 2.3, the signal 
Monte Carlo two-dimensional distribution on the cncrgy-vs-mass plane was 
fitted to a bivariate Gaussian plus a non-Gaussian tail produced by initial 
and final state radiation and other effects: 
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Table 1 

Upper limits at 90% CL for the t ^ analysis of Ref. [25]. 



Method 


Upper limit at 90% CL 


Bayesian with flat prior 


3.57 


Bayesian l/\/s + b 


3.30 


Bayesian + 6) 


3.06 


classical 


1.18 


classical, based on statistical significance 


1.03 


integration of likelihood 


2.30 


Monte Carlo likelihood technique 


1.37 


Feldman & Cousins [17] 
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where A, B, a^, ge, p, mo(~ m-j-), £'o(~ Eheam), and /? are the fit parameters. 
The background spatial pdf Bi was obtained by fitting data events observed 
in the vicinity of the signal region to a linear function: 



where floi 0,1 a-iid E' are the fit parameters, and {1711,1712) and (£"1, E2) are the 
limits defining the fit region. The three events observed in the signal region 
are located far from the peak of the signal Monte Carlo distribution. Thus, the 
maximum likelihood fit treats these events as background, and the extracted 
signal rate is consistent with zero. This explains why the likelihood integration 
technique gives the value of 2.30. 
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6 Conclusion 

There is no such thing as the "best" procedure for upper hmit estimation. An 
experimentahst is free to choose any procedure she/he hkes, based on her/his 
behef and experience. The only requirement is that the chosen procedure must 
have a strict mathematical foundation. The Bayesian method with a flat prior 
and the likelihood integration technique of Section 2.3 seem to have been the 
two most popular choices in the past few years. Typically, these two approaches 
give the most conservative values of upper limits. The purpose of this note 
is to show that there are other approaches, equally justified by mathematical 
formalism, which produce less conservative estimates. 
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